Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: atrium-repo #272

Open
wants to merge 43 commits into
base: main
Choose a base branch
from
Open

feat: atrium-repo #272

wants to merge 43 commits into from

Conversation

DrChat
Copy link

@DrChat DrChat commented Dec 23, 2024

This builds on str4d's work here: #168.

Big notes:

  • Complete merkle search tree implementation (but still lacking fuzz tests)
  • Complete Repository implementation with support for basic CRUD operations
  • Added a new pair of traits for block storage: AsyncBlockStoreRead and AsyncBlockStoreWrite.
    • In-memory storage backend (primarily intended for testing)
    • CAR storage (supporting both reading and writing)
    • Users can implement other storage sources, like Azure-backed block storage.

  • Merkle Search Tree
    • Parse Node
      • Read Node from bytes
      • Verify depth and sort order
      • Limit the number of TreeEntrys per Node to a statistically unlikely maximum length.
      • Consider limiting the overall depth of the repo, or other parameters, to prevent more sophisticated key mining attacks.
    • Locate key within node
    • Locate keys within node with a given prefix
    • Add entry
    • Edit entry
    • Delete entry
  • Repository
    • Load from a CAR file
    • Load from a firehose record
    • Read/write records inside of a repository
  • Storage
    • CAR files
      • Reading CAR files
      • Verify completeness of the repository structure.
      • Robustness to both duplication and de-duplication of blocks.
      • Ignore any unnecessary or unlinked blocks.
    • Commits from firehose records

@DrChat DrChat marked this pull request as draft December 23, 2024 20:14
@DrChat DrChat force-pushed the repo branch 6 times, most recently from 3b84377 to f8e988e Compare December 26, 2024 21:08
@DrChat DrChat force-pushed the repo branch 2 times, most recently from f736ccc to b230066 Compare January 1, 2025 19:20
loop {
let node = Node::read_from(&mut bs, node_cid).await?;
if !seen.insert(node_cid) {
// This CID was already seen. There is a cycle in the graph.
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it even possible to serialize a graph with a cycle?
The two nodes would point at each other, and by definition it should be impossible to compute their CIDs if there is a cycle.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I actually don't think this is possible, which means the extra validation logic here is probably optional and very unlikely to be hit in practice.
To err on the side of safety though, I think it would be best to leave this in-place.

@DrChat DrChat force-pushed the repo branch 2 times, most recently from 6db3530 to 60adc4a Compare January 18, 2025 21:58
@DrChat
Copy link
Author

DrChat commented Jan 18, 2025

I think the PR is in a pretty good state for a MVP.
My time is limited so I won't be able to quickly iterate on the functionality for creating new commits for a repository. Still planning on it, but the work here can be pushed up in the meantime :)

@DrChat DrChat marked this pull request as ready for review January 18, 2025 22:24
@erlend-sh
Copy link
Contributor

erlend-sh commented Jan 19, 2025

Pinging @str4d since this builds on their prior work.

@sugyan sugyan self-requested a review January 19, 2025 13:48
Copy link
Owner

@sugyan sugyan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have written several comments on areas unrelated to the internal logic. There are also a couple of points from clippy that I hope you can correct as well.

@@ -12,5 +12,8 @@ target/
# These are backup files generated by rustfmt
**/*.rs.bk

# VSCode settings
/.vscode
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not want to include anything related to a specific editor.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added this line because VSCode annoyingly creates this folder automatically (likely due to the Copilot extension being broken and populating a settings.json).
If it isn't present, it's almost certain that I or someone else may accidentally commit this folder.

If you don't want to have a line that explicitly mentions VSCode, another alternative you could consider is changing this .gitignore file to an allowlist.
Here's an example from one of my other projects.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Each individual's settings related to the editor should have their own global ignore settings, and I don't want to include changes on the repository side for that.

According to man gitignore:

Patterns which a user wants Git to ignore in all situations (e.g., backup or temporary files generated by the user’s editor of choice) generally go into a file specified by core.excludesFile in the user’s ~/.gitconfig. Its default value is $XDG_CONFIG_HOME/git/ignore. If $XDG_CONFIG_HOME is either not set or empty, $HOME/.config/git/ignore is used instead.

I believe that editor related ignore settings should be written there.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could I still persuade you to reconsider? :)
This is commonly done in many projects - for example, Rust itself has a .gitignore file checked in with this line present.

I do work with some projects that are exceptions to this rule and feel like a global setting would be more cumbersome than helpful.

"examples/concurrent",
"examples/firehose",
"examples/video",
# "examples/concurrent",
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason why these are no longer excluded?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah - I had forgotten about this.
I changed this because the examples were written against the last released version of atrium rather than the version present in the repository.
I needed to update the firehose example to incorporate the changes I have made here, and chose to change its dependencies to point at the crates in this repository.

The decision to have examples demonstrate the last public release seems to be an unusual one. I haven't ever seen this be done for any other Rust library.
Any thoughts about changing this?

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed. The current policy of using the latest release version does not allow us to use newly added packages.

We've discussed this before and made it the way it is now,
#100 (comment)

But maybe we should reconsider.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed - I think it would be wise to reconsider :)

One potential alternative to allow people to find examples for the previously released version would be to create git tags upon release and instruct users to look for the corresponding tag.
It actually looks like release-plz is already creating these tags for you (though appears to be doing so per-crate).

atrium-repo/Cargo.toml Outdated Show resolved Hide resolved
Cargo.toml Outdated Show resolved Hide resolved
[![](https://img.shields.io/crates/v/atrium-repo)](https://crates.io/crates/atrium-repo)
[![](https://img.shields.io/docsrs/atrium-repo)](https://docs.rs/atrium-repo)
[![](https://img.shields.io/crates/l/atrium-repo)](https://github.com/sugyan/atrium/blob/main/LICENSE)
[![Rust](https://github.com/sugyan/atrium/actions/workflows/api.yml/badge.svg?branch=main)](https://github.com/sugyan/atrium/actions/workflows/api.yml)
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this package also contains tests, you should also add workflows/repo.yml and this should be a link to it.

@DrChat DrChat force-pushed the repo branch 4 times, most recently from 690f6b1 to 5f55fc5 Compare January 31, 2025 01:03
DrChat added 26 commits February 1, 2025 11:52
Why: This function is a footgun that will likely be misused.
Deleting a block is not typically safe because it could have multiple incoming references (and thus deletion converts those into dangling references).
The only way to safely delete a block would be to wholistically enumerate an archive file and collect a list of all referenced blocks.
@DrChat
Copy link
Author

DrChat commented Feb 8, 2025

Note: It looks like there's a pretty exhaustive MST test suite here that could probably be leveraged to validate this implementation.
https://github.com/DavidBuchanan314/mst-test-suite

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants